predict.info — Premium Domain For Sale Domain only: USD 200,000. Prediction platform technology priced separately. predict.info
Claude Opus AI News List | Blockchain.News
AI News List

List of AI News about Claude Opus

Time Details
2026-06-10
17:27
Claude Opus 4.8 vs Fable 5 Chess App Build-Off

According to @godofprompt, a one-shot prompt tested Claude Opus 4.8 vs Fable 5 on a full chess app with rules, AI levels, and animations.

Source
2026-05-30
21:32
OpenClaw Update adds Claude Opus 4.8, Krea

According to @openclaw, the 2026.5.28 release adds Claude Opus 4.8, Krea via Fal, faster Gateway and plugins, plus Discord draft commentary.

Source
2026-05-29
12:15
Claude Opus 4.8 Boosts Code Quality 4x

According to @godofprompt, Anthropic says Opus 4.8 is 4x less likely to ship flawed code than 4.7, highlighting prompt engineering’s impact on reliability.

Source
2026-05-29
10:16
Gemini 3.5 Flash Debuts, Atlas Adds Moves

According to @AINewsOfficial_ on X, Gemini 3.5 Flash, Atlas updates, AGIBot X2 feat, and Claude Opus 4.8 performance claims are highlighted in a roundup.

Source
2026-05-28
20:40
Claude Opus 4.8 Drafts Paper, GPT5.5 Reviews

According to @emollick, Claude Opus 4.8 wrote a paper from archived research while GPT5.5 Pro reviewed, finding one major error that Opus fixed.

Source
2026-05-28
16:57
Claude Opus 4.8 Debuts with Longer Autonomy

According to @AnthropicAI, Claude Opus 4.8 improves judgment, transparency, and sustained autonomous work, and is available now at the same price.

Source
2026-05-19
08:04
Claude Opus 4.7 Regression Sparks Dev Backlash

According to @godofprompt, Opus 4.7 ignores project instructions and skips MCP configs; Anthropic acknowledged regressions versus 4.6 despite higher benchmarks.

Source
2026-05-09
22:15
Claude Opus 4.7 Boosts SWE-bench to 87.6%

According to @godofprompt, Claude Opus 4.7 follows instructions literally, lifts SWE-bench to 87.6% from 80.8%, and breaks 4.6-tuned prompts.

Source
2026-05-08
17:13
DeepSeek V4 powers Claude Code integration, 97% cheaper

According to God of Prompt, DeepSeek V4 natively runs Claude Code and costs $0.14 per million tokens versus $5.00 for Claude Opus 4.7.

Source
2026-05-06
16:45
Claude Opus boosts API limits after xAI deal

According to @SawyerMerritt, Anthropic raised Claude Opus API limits as xAI’s Colossus 1 adds over 300 MW capacity, expanding enterprise throughput.

Source
2026-04-29
16:08
Claude Opus 4.7 Supercharges Genspark Build

According to @godofprompt, Genspark Build uses Claude Opus 4.7 to turn ideas into websites, apps, and code, enabling rapid product testing at startup speed.

Source
2026-04-24
17:24
Anthropic Study: Claude Opus Outperforms Haiku in AI Agent Negotiations — Analysis and Business Implications

According to AnthropicAI on Twitter, simulated negotiations between Claude Opus and Claude Haiku agents showed Opus consistently securing substantially better deals, while human survey participants failed to perceive the gap, as reported by Anthropic’s post and study snippet. According to Anthropic, the result underscores how higher‑capability LLMs can translate model quality into tangible economic outcomes in automated bargaining and procurement workflows. As reported by Anthropic, this perception gap creates operational risks for enterprises that evaluate agent performance by intuition rather than outcome metrics, suggesting demand for rigorous A/B testing, revealable logs, and controllable negotiation policies in agentic systems. According to Anthropic, organizations deploying multi‑agent systems for sourcing, ad bidding, or dynamic pricing can realize measurable ROI by upgrading from lighter models to stronger models like Opus where negotiation or strategic reasoning is core.

Source
2026-04-23
18:16
OpenAI launches GPT 5.5: Benchmark gains over Claude Opus 4.7, GPT‑5.4‑class speed, and lower coding costs

According to The Rundown AI, OpenAI released GPT 5.5 with benchmark results showing it outperforming Claude Opus 4.7 in coding, reasoning, and math, while matching GPT‑5.4 speed at roughly half the cost of competing frontier coding models. As reported by The Rundown AI, these gains signal a renewed performance lead for OpenAI in developer-focused tasks, suggesting immediate business opportunities in code-generation tooling, agentic workflows, and LLM-powered test automation where lower inference cost and faster latency materially reduce unit economics.

Source
2026-04-21
17:12
Google Deep Research Max Breakthrough: 85.9% BrowseComp Score, Gemini 3.1 Pro, $2–$5 Reports, and MCP Integrations – 2026 Analysis

According to The Rundown AI, Google released an autonomous research agent, Deep Research Max, that achieved 85.9% on BrowseComp, a benchmark for locating hard-to-find facts online, outperforming GPT-5.4 at 58.9% and Claude Opus 4.6 at 45.1%. As reported by The Rundown AI, Deep Research Max is powered by Gemini 3.1 Pro, designed to run overnight, and costs roughly $2–$5 per due diligence report, addressing enterprise-scale research workflows. According to The Rundown AI citing Google’s launch blog, enterprises can schedule a nightly cron job to generate exhaustive due diligence reports by morning, signaling a shift toward automated research operations. As reported by The Rundown AI, FactSet, S&P, and PitchBook are building MCP servers so the agent can plug directly into premium financial data, creating opportunities for investment research, private markets analysis, and risk intelligence.

Source
2026-04-21
03:26
Kimi K2.6 Open-Weights Model vs Claude Opus 4.6: Latest Benchmark Analysis, Real-World Gaps, and 6 Business Takeaways

According to Artificial Analysis, Kimi K2.6 ranks #4 on the Artificial Analysis Intelligence Index with a score of 54, trailing Anthropic, Google, and OpenAI at 57, and posts an Elo of 1520 on GDPval-AA agentic tasks using the Stirrup harness with tools like code execution and web browsing (source: Artificial Analysis thread referenced by Ethan Mollick on X). According to Artificial Analysis, K2.6 maintains a 96% score on τ²-Bench Telecom for tool use and supports multimodal image and video inputs with 256k context, while exposing open weights via first-party and third-party APIs including Novita, Baseten, Fireworks, and Parasail (source: Artificial Analysis). According to Artificial Analysis, K2.6’s hallucination behavior is reported as low and comparable to Claude Opus 4.7 and MiniMax-M2.7 on the AA-Omniscience Index, with token consumption of ~160M reasoning tokens for the full Index run versus ~190M for Claude Sonnet 4.6 and ~110M for GPT 5.4 (source: Artificial Analysis). According to Ethan Mollick citing Artificial Analysis, user feedback notes that despite benchmark wins, open-weights models like Kimi can underperform in real-world usage compared with closed models such as Claude Opus 4.6, underscoring a benchmark-to-production gap (source: Ethan Mollick on X). Business implications: teams can pilot Kimi K2.6 for agentic workflows and tool-use heavy tasks given its open weights and third-party hosting, but should validate with task-specific evals and track token costs; competitive positioning suggests Anthropic and OpenAI remain top for general reliability while Kimi expands open-weights options for procurement and vendor diversification (sources: Artificial Analysis; Ethan Mollick).

Source
2026-04-18
00:56
GDPval AA Benchmark Criticized: Ethan Mollick Challenges Gemini 3.1 Judging Method in Artificial Analysis Index

According to @emollick, GDPval-AA is not a meaningful benchmark because it uses Gemini 3.1 to judge model outputs on public GDPval questions, which he argues adds little signal about true capability. As reported by Artificial Analysis, Claude Opus 4.7 leads GDPval-AA with 1,753 Elo and tops the Artificial Analysis Intelligence Index at 57.3, narrowly ahead of Gemini 3.1 Pro at 57.2 and GPT-5.4 at 56.8; the firm states GDPval-AA spans 44 occupations and 9 industries using an agentic loop with shell and browsing via the Stirrup harness. According to Artificial Analysis, Opus 4.7 improves on IFBench (+5.5 p.p.), TerminalBench Hard (+5.3 p.p.), HLE (+2.9 p.p.), SciCode (+2.6 p.p.), and GPQA Diamond (+1.8 p.p.), while reducing hallucinations to 36% and using ~35% fewer output tokens than Opus 4.6 to run the suite. For businesses, the dispute over GDPval-AA’s evaluator design highlights the need to diversify benchmarks (e.g., HLE, GPQA Diamond, TerminalBench, AA-Omniscience) and to audit judge-model dependence to avoid evaluator bias and overfitting, as indicated by both Ethan Mollick’s critique and Artificial Analysis’ published methodology.

Source
2026-04-17
16:25
Claude Design Launch: Anthropic’s Opus 4.7 Auto‑Generates UI from Prompts — First Look and Business Impact

According to The Rundown AI on X, Anthropic has launched Claude Design, a generative UI tool where users describe an interface and Claude Opus 4.7 produces a first version that can be refined via inline comments and direct edits; the debut follows reports that Anthropic exec Mike Krieger left Figma’s board amid a competing product launch (as reported by The Rundown AI). According to The Rundown AI, this positions Anthropic to compete in rapid product design and prototyping by collapsing idea-to-mockup cycles and could reduce reliance on traditional design workflows for early-stage iterations. For product teams and startups, the opportunity is faster A/B testing, instant design variations, and lower design costs, while enterprise buyers may seek governance features and version control to integrate Claude Design into existing design ops, according to The Rundown AI.

Source
2026-04-17
01:56
Claude Opus 4.7 Adaptive Thinking Criticism Spurs Fixes: Latest Analysis on Anthropic’s Response and Business Impact

According to Ethan Mollick on X, Anthropic is exploring fixes to Claude Opus 4.7’s adaptive thinking behavior after users reported degraded results on non-math and non-code tasks due to an automatic effort router without a manual override (as reported in Mollick’s thread and a reply from a Claude product manager). According to Mollick, the model often classifies general writing or reasoning prompts as low effort, leading to lower-quality outputs compared with scenarios where users can force higher-effort reasoning, as available in ChatGPT. According to the public exchange on X, Anthropic’s acknowledgement indicates imminent product adjustments, which could improve reliability for enterprise knowledge work, marketing content, and analyst workflows that depend on consistent high-effort reasoning. As reported by Mollick’s post, adding a manual override or better routing thresholds would reduce failure modes in task triage and can lower re-run costs, improve prompt trust, and increase adoption in professional settings that require deterministic control over model depth.

Source
2026-04-16
20:47
Claude Opus 4.7 Shows Breakthrough TikZ Drawing Skills: Best ‘Sparks of AGI’ Unicorn Yet

According to Ethan Mollick on Twitter, Anthropic’s Claude Opus 4.7 now generates the strongest TikZ-based “Sparks unicorn” to date, outperforming prior attempts even without deliberate chain-of-thought, and performing exceptionally when it does reason (source: Ethan Mollick, Twitter, Apr 16, 2026). As reported by Mollick, the unicorn is rendered in TikZ—a LaTeX diagram language not intended for free-form artwork—mirroring the original Sparks of AGI evaluation where a model’s ability to draw a primitive unicorn signaled emergent capabilities (source: Ethan Mollick, Twitter; Microsoft Research, “Sparks of Artificial General Intelligence,” 2023). According to Microsoft Research, the unicorn task probes compositional reasoning and programmatic graphics generation, which are relevant for enterprise automation of technical documentation, scientific figures, and reproducible visualization workflows in LaTeX (source: Microsoft Research, 2023). For businesses, improved TikZ code synthesis suggests near-term productivity gains in scientific publishing, data-heavy reports, and developer tooling where LLMs convert natural language into maintainable vector-graphic code, reducing designer handoff time and enabling version-controlled diagrams at scale (source: Ethan Mollick, Twitter; Microsoft Research, 2023).

Source
2026-04-16
19:45
Claude Opus 4.7 Adaptive Thinking Criticized: User Reports Lower Quality on Non‑Technical Tasks – Analysis and Business Implications

According to Ethan Mollick on Twitter, Claude Opus 4.7’s adaptive thinking requirement often misclassifies non‑math and non‑code prompts as low effort, yielding worse results compared to tasks it deems high effort, and lacks a manual override similar to ChatGPT’s controls (as reported by Ethan Mollick, Apr 16, 2026). According to Mollick’s post, the absence of a user-selectable effort mode limits control over reasoning depth, potentially degrading outputs for writing, strategy, and qualitative analysis. From an AI product perspective, this suggests opportunities for providers to add explicit effort controls, per‑task reasoning budgets, and transparent routing indicators; vendors serving enterprise content, marketing, and consulting workflows could differentiate with tunable reasoning settings and audit logs for model routing decisions, according to the same source.

Source
World Cup